In [1]:
# Hide Code Cells
from IPython.display import HTML
HTML('''
<script
    src="https://cdnjs.cloudflare.com/ajax/libs/jquery/2.0.3/jquery.min.js ">
</script>
<script>
code_show=true; 
function code_toggle() {
 if (code_show){
 $('div.jp-CodeCell > div.jp-Cell-inputWrapper').hide();
 } else {
$('div.jp-CodeCell > div.jp-Cell-inputWrapper').show();
 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit"
    value="Click here to toggle on/off the raw code."></form>
''')
Out[1]:

image-2.png


Table of Contents


SECTIONS

Introduction
Our Dataset
Methodology
Code Breakdown
Conclusion
Recommendations
References


Introduction


Welcome to the world of SALITA Pro Max Ultra, the enhanced and advanced version of the original SALITA (Spoken Asian Language Identification and Tonal Analysis) project. In our team's relentless pursuit of harnessing the linguistic diversity of Asia, we have developed a state-of-the-art language classification model that takes communication to new heights. Our mission remains steadfast: to enhance communication, promote inclusivity, and enable accurate language-specific services across a wide range of applications and systems. Join us on this exciting journey as we unveil the remarkable advancements of SALITA Pro Max Ultra.

SALITA Pro Max Ultra represents a significant leap forward in language classification technology. Through the integration of deep learning and explainable AI techniques, coupled with extensive training and optimization, our model has achieved an unprecedented level of understanding of the audio parts it pays attention to. This newfound level of explainability empowers us to fine-tune and optimize SALITA Pro Max Ultra, ensuring even more accurate and reliable language-specific services.

With SALITA Pro Max Ultra, we are determined to bridge language barriers like never before. By unraveling the complexities of spoken languages and understanding the audio features and gradients that our model focuses on during the classification process, we can provide seamless multilingual interactions in diverse Asian communities. Our goal is to empower individuals from different language backgrounds to communicate effortlessly, fostering a sense of unity and understanding.

SALITA Pro Max Ultra not only improves upon its predecessor but also expands the horizons of language classification. Our model provides a foundation for the development of language-specific applications and systems that cater to the unique linguistic characteristics of each language. From speech recognition to translation services, SALITA Pro Max Ultra opens up a world of possibilities, empowering businesses and individuals to communicate effectively and seamlessly across linguistic boundaries.

We invite you to embark on this exciting journey with us as we push the boundaries of language classification and enhance communication in Asia and beyond. SALITA Pro Max Ultra represents a significant advancement in our quest to bridge language gaps and foster meaningful connections. Stay tuned for further updates, discoveries, and real-world applications of SALITA Pro Max Ultra as we continue to make strides in the field of language classification and promote the power of multilingual communication.

image.png

Back to Table of Contents


Our Dataset


In the realm of language analysis and modeling, a treasure trove of linguistic diversity awaits. The Lang_data dataset, a vast collection of recordings from 14 languages spoken in Asian countries, opens the doors to a world of linguistic exploration and innovation. Sourced from the renowned platform Kaggle, this extensive dataset comprises a staggering 294,000 .wav files, collectively weighing in at a substantial 80 GB. Join us as we delve into the depths of this remarkable dataset, uncovering its immense potential for language analysis, modeling, and the development of language-specific technologies.

The Lang_data dataset showcases the linguistic tapestry of Asian countries, encompassing a rich variety of languages. From Arabic to Urdu, and everything in between, this comprehensive collection encompasses languages such as Burmese, Chinese, Hindi, Indonesian, Japanese, Kannada, Nepali, Panjabi, Persian, Sinhala, Tamil, and Thai. With such a diverse range of languages represented, the dataset becomes a powerful tool for understanding the nuances and intricacies of each language.

image-2.png

With its extensive size and linguistic diversity, the Lang_data dataset provides an unparalleled opportunity for researchers and language experts to delve deep into the structure, patterns, and characteristics of each language. Phonetics, tonal variations, and linguistic features can be explored and analyzed, shedding light on the unique aspects of these languages. Such insights can fuel advancements in linguistic research, language modeling, and the development of accurate language-specific technologies.

The inclusion of 14 distinct languages in the Lang_data dataset paves the way for the development of a wide array of language-specific applications and services. Researchers can harness this vast resource to train language classification models, build robust speech recognition systems, and design tailored language-specific services. The scale of the dataset enables accurate analysis and supports the creation of technologies that cater to the needs of speakers from diverse linguistic backgrounds.

The Lang_data dataset is not just a compilation of audio files; it is a catalyst for innovation in communication technologies. By leveraging the rich variety of languages and the wealth of linguistic insights offered by this dataset, researchers can pave the way for advancements in machine translation, voice assistants, and other language-related applications. These technologies have the potential to bridge language barriers, foster cross-cultural understanding, and enhance communication on a global scale.

The Lang_data dataset stands as a testament to the beauty and diversity of Asian languages. Its vast collection of audio recordings provides an invaluable resource for linguistic research, language modeling, and the development of language-specific technologies. As we explore the depths of this dataset, we unveil the potential for groundbreaking advancements in communication and cross-cultural understanding. Together, let us embrace the richness of Asian languages and harness the power of the Lang_data dataset to drive innovation and enable seamless multilingual interactions.

image.png

Overall, the Lang_data dataset offers a valuable resource for studying and understanding the 14 Asian languages, enabling advancements in linguistic research, communication technologies, and language-specific applications.

Abbreviation Variable Name
AR Arabic
FA Persian
HI Hindi
KN Kannada
NE Nepali
PA Panjabi
SI Sinhala
TA Tamil
UR Urdu
ID Indonesian
MY Malaysian
TH Thai
JA Japanese
ZH Chinese

Back to Table of Contents


Methodology


In our exploration of the Lang_data dataset, we embarked on a meticulous three-step journey that led to the birth of SALITA Pro Max Ultra, an exceptional language classification model. By tailoring the dataset and dataloader, we laid a strong foundation for our analysis. Through iterative refinement of the Deep Learning (DL) model, we strived to unlock its full potential, aiming to maximize accuracy and generalization. The integration of Explainable AI (XAI) added a layer of transparency and interpretability, empowering us to unravel the model's decision-making process. This holistic approach positions SALITA Pro Max Ultra as a catalyst for revolutionary advancements in multilingual communication and language-specific applications, while deepening our appreciation of the intricate tapestry of linguistic diversity.

image-2.png


Dataset and Dataloader Customization

In the realm of SALITA Pro Max Ultra's development, we began by meticulously curating the dataset. We performed a series of preprocessing steps to extract vital audio features such as spectrograms or MFCCs, which encapsulated the unique, identifying essence of each language within the dataset.

To tackle the vastness of the dataset, we encoded the seamless loading and batching of audio samples. This optimized memory usage, while enabling parallel processing. Dataset and dataloader customization elevated the training process, enabling SALITA Pro Max Ultra to be trained on linguistic diversity and spoken Asian language classification.

Model Optimization

The optimization process played a crucial role in refining SALITA Pro Max Ultra to achieve its exceptional performance. From the ground up, we built the DL Classifier Model and employed various techniques to enhance its accuracy and efficiency. This involved thorough experimentation with hyperparameters, architectural modifications, and the implementation of regularization techniques.

Fine-tuning the hyperparameters was a critical step in optimizing the model. We iteratively adjusted parameters such as the learning rate, batch size, and optimizer settings to identify the optimal configuration for training. Through careful validation and experimentation, we determined the combination of hyperparameters that maximized the model's performance.

Furthermore, we explored and evaluated different model layers of the convolutional neural networks (CNNs) architecture to find the most suitable one for capturing the complex language patterns within the dataset. By carefully building and adapting the architecture, we ensured that the model could effectively capture and analyze the intricate linguistic features.

Through an iterative and systematic approach to optimization, SALITA Pro Max Ultra was fine-tuned to achieve exceptional performance in language classification. The extensive experimentation and careful adjustments made during the optimization process enabled us to harness the full potential of the model and deliver outstanding results.

Evaluation and Explainable AI

Evaluation of SALITA Pro Max Ultra involved rigorous assessment of its performance and interpretability. To evaluate the model's accuracy, it was tested on an independent test set comprising unseen language samples. Accuracy was calculated to measure the model's ability to correctly classify languages.

However, SALITA Pro Max Ultra went beyond traditional evaluation metrics by incorporating XAI. XAI provides insights into the model's decision-making process, making it more interpretable and understandable. By understanding which audio features and gradients the model focused on during the classification process, the model's behavior and reasoning became transparent.

In our study on the Lang_data audio dataset, we aimed to address the challenge of explaining the behavior of deep neural networks, which are often considered black boxes. Understanding which parts of an audio file contribute to the classification output of a deep network is crucial for interpretability and transparency.

We trained another DL network called the Explainer. The Explainer is specifically trained to locate attributions for a given audio file based on the predictions made by our trained DL "black-box" classifier, which we refer to as the Explanandum.

Our approach yields masks that are more precise along the time steps and have clearly defined boundaries. This enhanced precision allows us to identify the specific regions of an audio file that contribute to the classification decision made by the black-box classifier.

One notable advantage of our approach is its capability to generate separate masks for each class label in a multi-class setting. This means that we can identify the specific areas of an audio that are relevant to each individual class, providing more detailed and class-specific explanations.

Moreover, our approach is highly efficient as it only requires a single forward pass through the Explainer network to generate the masks. This efficiency is advantageous compared to other methods that may require more computationally expensive operations.

By employing our approach, we were able to generate informative and precise masks that explain the decision-making process of the black-box classifier, shedding light on the important features and regions within the audio data that contribute to the classification outcomes.

By combining rigorous DL and XAI techniques, SALITA Pro Max Ultra not only provided exceptional performance but also enhanced interpretability, bridging the gap between AI and human understanding.

Back to Table of Contents


Code Breakdown


In the following sections, we will provide a breakdown of the code snippet and explain the purpose and functionality of each imported library and module. This will help us understand the tools and techniques used in data processing, model development, and analysis. Let's dive into the code and explore its components in detail.

Back to Table of Contents

Importing Libraries


The following code snippet imports necessary libraries and modules such as Torch, TorchAudio, Pandas, Matplotlib, and Pickling. It sets up the environment for data processing, model training, and visualization, facilitating efficient and comprehensive analysis.

In [2]:
import torch
from torch import nn, optim
import torch.nn.functional as F
import torchaudio as ta
from torchaudio import transforms
from torch.utils.data import DataLoader, Dataset
import sys
import pytorch_lightning as pl
from transformers import AutoFeatureExtractor, ASTForAudioClassification, Wav2Vec2Model
from torch.optim import Adam
from torchvision import models
from utils.helper import get_targets_from_annotations
from utils.metrics import SingleLabelMetrics
from IPython.display import Audio

from pathlib import Path
from models.explainer import Deeplabv3Resnet50ExplainerModel
from models.classifier import VGG16ClassifierModel, Resnet50ClassifierModel
from models.explainer_salita import ExplainerClassifierModel
from utils.image_utils import save_mask, save_masked_image, save_all_class_masks
from utils.loss import TotalVariationConv, ClassMaskAreaLoss, entropy_loss
import pandas as pd

from pyjanitor import auto_toc
toc = auto_toc()

import os, re, shutil, copy, zipfile, glob
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import IPython.display as ipd
from pathlib import Path

from tqdm import tqdm, trange
import time
import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np

import pickle
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

from pyjanitor import auto_toc
toc = auto_toc()

from pickling import *
There was a problem when trying to write in your cache folder (/home/mbalogal/.cache/huggingface/hub). You should set the environment variable TRANSFORMERS_CACHE to a writable directory.
Matplotlib created a temporary config/cache directory at /tmp/matplotlib-7knnau74 because the default path (/home/mbalogal/.cache/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
In [3]:
# username = 'cvillarin'
username = 'mbalogal'
# username = 'vdelossantos'
# username = 'jfabrero'
os.environ['XDG_CACHE_HOME'] = f'/home/msds2023/{username}/.cache'
os.environ['HUGGINGFACE_HUB_CACHE'] = f'/home/msds2023/{username}/.cache'

Selecting the Computing Device


Utilizing a CUDA GPU significantly reduces training/testing time, accelerating network operations. Let's check for its availability and select it as our device for maximum performance.
In [4]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(device)
cuda

Unpacking the Dataset


In this snippet, we extract data from a zip file and identify the classes within the dataset. The zip file, referred to as data.zip is extracted using a simple command, allowing us to access its contents. The classes are determined by iterating through the directories and capturing the names of the files. This step sets the stage for further analysis and exploration of the dataset, as we gain insights into the different categories or labels present in the data.

To ensure reproducibility and accessibility throughout our analysis, we place the zip file in the same working directory as this notebook.
In [5]:
zip_ = 'data.zip' # Replace with downloaded .zip from Kaggle
with zipfile.ZipFile(zip_, 'r') as zip_ref:
        zip_ref.extractall('.')
        
classes = [x.split('/')[-1] for x in glob.glob('./data/*') if x[-3] == '/']

Renaming the Files


The rename_audio function serves the purpose of renaming audio files within each class. By iterating over the given list of classes, the function proceeds to navigate through the corresponding directory containing audio files. For each audio file in the directory, the function checks if it has a ".wav" extension and proceeds with renaming it.

The new name is generated based on the class name and an index number. This process ensures that each audio file within its respective class is uniquely identified and can be easily referenced for further analysis or processing.
In [6]:
def rename_audio(class_):
    """Renames each audio in each class"""
    for c in tqdm(class_):
        path = os.path.join('./data', c)
        for i, audio in enumerate(os.listdir(path)):
            if audio.endswith('.wav'):
                new_name = f'{c}_{i}.wav'
                os.rename(os.path.join(path, audio),
                          os.path.join(path, new_name))
In [7]:
rename_audio(classes)
100%|██████████| 14/14 [00:00<00:00, 59.33it/s]

Creating the Dataset


The create_dataset function is responsible for creating a dataset by copying audio files from a source directory (`src`) to a destination directory (`dst`) according to a specified distribution (`dist`). The function takes a list of classes (`class_`) and divides the data into training, validation, and test sets based on the distribution percentages.

The function first checks if the destination directory exists and if the overwrite flag is set to True, it removes the existing directory. Then, it iterates over each class, determines the number of data samples for each set based on the distribution, and creates the necessary directories for each stage (train, validation, test) and class.

For each class and stage, the function iterates over the specified range of samples, copies the corresponding audio file from the source directory to the destination directory, and maintains the original file naming convention. Finally, the function prints the total number of audio files for each class in each stage.

This function provides a convenient way to organize and distribute the dataset into different sets for training, validation, and testing purposes, ensuring a balanced representation of classes across the dataset splits.
In [8]:
def create_dataset(src, dst, class_, dist=(.6,.2,.2), overwrite=False):
    """Copy images of class `class_` using `dist` from src to dst.
    """
        
    if os.path.exists(dst) and overwrite:
        shutil.rmtree(dst)
    
    for c in tqdm(class_):
        c_path = os.path.join(src, c)
        n_data = len(os.listdir(c_path))
        ns = list(map(lambda x: int(n_data*x), dist))
        ns = [0]+[x+sum(ns[:i]) for i, x in enumerate(ns)]

        
        for i, stage in enumerate(['train', 'validation', 'test']):
            stage_path = os.path.join(dst, stage)
            if not os.path.exists(stage_path):
                os.makedirs(stage_path)
            elif os.path.exists(stage_path) and overwrite == False:
                continue
                
            label_path = os.path.join(stage_path, c)
            os.makedirs(label_path)
            
            for j in range(ns[i],ns[i+1]):
                fname = f'{c}_{j}.wav'
                src_file = os.path.join(c_path, fname)
                dst_file = os.path.join(label_path, fname)
                try:
                    shutil.copyfile(src_file, dst_file)
                except:
                    pass
        
    for stage in ['train', 'validation', 'test']:
        for c in class_:
            label_path = os.path.join(os.path.join(dst, stage), c)
            n_data = len(os.listdir(label_path))
            print(f'Total {stage.title()} {c.title()} Audio:', f'\t{n_data}')
In [9]:
src = 'data'
dst = 'data/subset'
create_dataset(src, dst, classes, overwrite=True)
100%|██████████| 14/14 [00:01<00:00,  9.23it/s]
Total Train Ar Audio: 	14
Total Train Fa Audio: 	14
Total Train Hi Audio: 	14
Total Train Id Audio: 	11
Total Train Ja Audio: 	14
Total Train Kn Audio: 	14
Total Train My Audio: 	14
Total Train Ne Audio: 	14
Total Train Pa Audio: 	14
Total Train Si Audio: 	14
Total Train Ta Audio: 	14
Total Train Th Audio: 	14
Total Train Ur Audio: 	14
Total Train Zh Audio: 	14
Total Validation Ar Audio: 	5
Total Validation Fa Audio: 	5
Total Validation Hi Audio: 	5
Total Validation Id Audio: 	4
Total Validation Ja Audio: 	5
Total Validation Kn Audio: 	5
Total Validation My Audio: 	5
Total Validation Ne Audio: 	5
Total Validation Pa Audio: 	5
Total Validation Si Audio: 	5
Total Validation Ta Audio: 	5
Total Validation Th Audio: 	5
Total Validation Ur Audio: 	5
Total Validation Zh Audio: 	5
Total Test Ar Audio: 	5
Total Test Fa Audio: 	5
Total Test Hi Audio: 	5
Total Test Id Audio: 	4
Total Test Ja Audio: 	5
Total Test Kn Audio: 	5
Total Test My Audio: 	5
Total Test Ne Audio: 	5
Total Test Pa Audio: 	5
Total Test Si Audio: 	5
Total Test Ta Audio: 	5
Total Test Th Audio: 	5
Total Test Ur Audio: 	5
Total Test Zh Audio: 	5

Preparing the Path and Metadata


In this code snippet, we initialize directories for the training, validation, and test datasets. The audio_path variable represents the base path where the dataset subsets are located. The paths for each subset are stored in a dictionary called paths, with keys representing the subset names (train, validation, test) and values representing the corresponding directory paths.

The code then prints the directory paths for each subset using f-strings and the print function. It displays the training dataset directory, the validation dataset directory, and the test dataset directory, providing an overview of where the different subsets are located within the file system.
In [10]:
# Initialize Directories
audio_path = Path('data/subset')
paths = {x: audio_path / x for x in ['train', 'validation', 'test']}

print(f'Training Dataset Directory: \t{paths["train"]}')
print(f'Validation Dataset Directory: \t{paths["validation"]}')
print(f'Test Dataset Directory: \t{paths["test"]}')
Training Dataset Directory: 	data/subset/train
Validation Dataset Directory: 	data/subset/validation
Test Dataset Directory: 	data/subset/test
The get_annotations function is responsible for parsing audio files and extracting metadata from them. It takes two arguments: paths, which contains the directory paths for the different dataset subsets, and clases, which is a list of class names.

Within the function, there is a nested loop that iterates over the subsets and the classes. For each class, it iterates over the audio files located in the corresponding subset directory. For each audio file, it creates a dictionary containing the path of the audio file, its label (class name), and the label index (position of the class in the clases list).

These dictionaries are then appended to a list called items. After iterating through all the audio files, a pandas DataFrame is created using the items list. The DataFrame is then saved as a CSV file with the name of the current subset (stage) using the to_csv function.

Overall, this function enables the extraction of metadata from the audio files and organizes it into CSV files for each subset, providing a structured representation of the dataset for further analysis and processing.
In [11]:
def get_annotations(paths, classes=classes):
    """Parse audio files and get metadata"""
    for i, (stage, path) in enumerate(paths.items()):
        items = []
        for j, c in tqdm(enumerate(classes)):
            for audio in os.listdir(f'{path}/{c}'):
                audio_path = f'{path}/{c}/{audio}'
                
                items.append({
                    'path': audio_path,
                    'label': c,
                    'label_index': j,
                })
            
        df = pd.DataFrame(items)
        df.to_csv(f'./{stage}.csv', header=False)
        
get_annotations(paths, classes=classes)
14it [00:00, 1952.72it/s]
14it [00:00, 2255.52it/s]
14it [00:00, 2230.76it/s]

Dataset and LightningDataModule


The following code defines two classes: AudioDataset and AudioDataModule. Let's go through each class and understand their functionalities:

  1. AudioDataset:
    • This class is a custom dataset that inherits from the Dataset class provided by the PyTorch library.
    • The __init__ method initializes the dataset object. It takes meta_data (a CSV file path) and num_frames as input. It reads the CSV file using pd.read_csv and stores it in self.meta_data.
    • The __len__ method returns the length of the dataset, which is the number of rows in self.meta_data.
    • The __getitem__ method is responsible for retrieving an item from the dataset given an index.
      • It obtains the audio sample path, label, label index, and signal by calling internal helper methods (_get_audio_sample_path, _get_audio_sample_label, _get_audio_sample_label_index).
      • It loads the audio sample using ta.load and applies a mel-spectrogram transformation using transforms.MelSpectrogram from the torchvision library.
      • The resulting mel-frequency cepstral coefficients (mfcc) are returned along with the label, label index, signal, and audio sample path.



2. AudioDataModule:

  • This class is a PyTorch Lightning LightningDataModule that handles the data loading and processing for the audio dataset.
  • The __init__ method initializes the data module object. It takes parameters such as batch_size, num_workers, and pin_memory.
  • The setup method is used to define the datasets for different stages (train, validation, test). It creates instances of the AudioDataset class for each stage.
  • The pad_sequence method pads the sequences in a batch with zeros to make them the same length.
  • The collate_fn method is a custom collate function used by the DataLoader. It gathers tensors, targets, and file paths from the batch and returns them in a formatted way.
  • The train_dataloader, val_dataloader, and test_dataloader methods return DataLoader objects for the respective stages, with appropriate settings such as batch size, shuffling, collate function, and number of workers.

These classes can be used to handle audio data loading, preprocessing, and batching in a PyTorch-based deep learning project. The AudioDataModule provides an organized and standardized way to define and access data loaders for different stages of the training process.

In [12]:
class AudioDataset(Dataset):
    def __init__(self, meta_data, num_frames=160_000):
        self.num_frames = num_frames
        self.meta_data = pd.read_csv(meta_data, header=None, index_col=0)

    def __len__(self):
        return len(self.meta_data)

    def __getitem__(self, index):
        # Edited
        audio_sample_path = self._get_audio_sample_path(index)
        label = self._get_audio_sample_label(index)
        label_index = self._get_audio_sample_label_index(index)
        signal, sr = ta.load(audio_sample_path, num_frames=self.num_frames)
        transform = transforms.MelSpectrogram(sr, n_mels=40)
        mfcc = transform(signal).squeeze()
        
        return mfcc, label, label_index, signal, audio_sample_path

    def _get_audio_sample_path(self, index):
        path = self.meta_data.iloc[index, 0]
        path = os.path.join(os.getcwd(),path)
        return path

    def _get_audio_sample_label(self, index):
        return self.meta_data.iloc[index, 1]
    
    def _get_audio_sample_label_index(self, index):
        return self.meta_data.iloc[index, 2]
    
class AudioDataModule(pl.LightningDataModule):
    def __init__(self, batch_size=256, num_workers=0, pin_memory=True):
        super().__init__()
        self.batch_size = batch_size
        self.num_workers = num_workers
        self.pin_memory = pin_memory
        self.datasets = {}
        self.dataloaders = {}

    def setup(self, stage=None):
        stages = ['train', 'validation', 'test']

        # Define your datasets
        self.datasets = {
            # Edited
            x: AudioDataset(f'{x}.csv')
            for x in stages
        }

    def pad_sequence(self, batch):
        # Make all tensors in a batch the same length by padding with zeros
        batch = [item.t() for item in batch]
        batch = torch.nn.utils.rnn.pad_sequence(batch,
                                                batch_first=True,
                                                padding_value=0.)
        return batch.permute(0, 2, 1)

    def collate_fn(self, batch):
        tensors, targets, paths = [], [], []

        # Gather tensors and encode labels as indices
        for mel, _, label_index, _, filepath in batch:
            tensors += [mel]
            targets += [torch.tensor(label_index)]
            paths += [filepath]

        # Group the list of tensors into a batched tensor
        tensors = self.pad_sequence(tensors)
        targets = torch.stack(targets)

        return tensors, targets, paths

    def train_dataloader(self):
        return DataLoader(
            self.datasets['train'],
            batch_size=self.batch_size,
            shuffle=True,
            collate_fn=self.collate_fn,
            num_workers=self.num_workers,
            pin_memory=self.pin_memory,
        )

    def val_dataloader(self):
        return DataLoader(
            self.datasets['validation'],
            batch_size=self.batch_size,
            shuffle=False,
            collate_fn=self.collate_fn,
            num_workers=self.num_workers,
            pin_memory=self.pin_memory,
        )

    def test_dataloader(self):
        return DataLoader(
            self.datasets['test'],
            batch_size=self.batch_size,
            shuffle=False,
            collate_fn=self.collate_fn,
            num_workers=self.num_workers,
            pin_memory=self.pin_memory,
        )

Building the DL Classifier (SALITA Pro Max Ultra)


The next code defines a PyTorch Lightning module named SALITA. Let's go through its main components:

  1. __init__ method:
    • Initializes the SALITA module with parameters such as num_classes, dataset, learning_rate, and metrics_threshold.
    • Calls other setup methods to initialize the model, losses, and metrics.



2. setup_model method:

  • Sets up the architecture of the model.
  • It defines a convolutional neural network (CNN) with several convolutional layers (Conv1d) followed by max pooling (MaxPool1d), and fully connected layers (Linear).
  • The number of classes is determined by num_classes.



3. setup_losses method:

  • Sets up the loss function for the model.
  • In this case, it uses the cross-entropy loss (CrossEntropyLoss) for multiclass classification.



4. forward method:

  • Implements the forward pass of the model.
  • The input x is passed through the convolutional layers with ReLU activations and max pooling operations.
  • The output is then flattened and passed through fully connected layers with ReLU activations.
  • The final output is passed through a softmax activation function (log_softmax) to obtain class probabilities.



5. setup_metrics method:

  • Sets up metrics for training, validation, and testing.
  • It uses a custom metric class named SingleLabelMetrics initialized with the number of classes.



6. training_step, validation_step, and test_step methods:

  • Define the operations performed during a single step of the respective stages (training, validation, testing).
  • They compute the model's output, calculate the loss, and update the metrics accordingly.
  • The training and test steps also calculate accuracy by comparing the predicted labels with the true labels.



7. configure_optimizers method:

  • Specifies the optimizer used for training.
  • In this case, it uses the Adam optimizer (Adam) with the specified learning rate.



8. on_test_epoch_end method:

  • Executed at the end of the testing epoch.
  • Computes the test metrics, saves them, and resets the metrics for future calculations.
  • The test metrics are logged, including the accuracy, and saved as instance attributes.

The SALITA module is designed for audio classification tasks using a CNN architecture. It provides methods for training, validation, testing, and configuring the optimizer.

In [13]:
class SALITA(pl.LightningModule):
    def __init__(self,
                 num_classes=14,
                 dataset="lang_data",
                 learning_rate=1e-5,
                 metrics_threshold=0.0):
        super().__init__()

        self.setup_model(num_classes)
        self.setup_losses()
        self.setup_metrics(num_classes=num_classes)

        self.num_classes = num_classes
        self.learning_rate = learning_rate
        self.dataset = dataset

    def setup_model(self, num_classes):
        self.conv1 = nn.Conv1d(40, 64, kernel_size=3, stride=1)
        self.conv2 = nn.Conv1d(64, 128, kernel_size=3, stride=1)
        self.conv3 = nn.Conv1d(128, 256, kernel_size=3, stride=1)
        self.conv4 = nn.Conv1d(256, 512, kernel_size=3, stride=1)
        self.conv5 = nn.Conv1d(512, 1024, kernel_size=3, stride=1)
        self.pool = nn.MaxPool1d(kernel_size=2, stride=2)
        self.fc1 = nn.Linear(1024 * 23, 512)
        self.fc2 = nn.Linear(512, num_classes)
        
        # Load parameters from .pth file
        pretrained_file = "final_model_checkpoint.pth" 
#         pretrained_file = "/mnt/processed/private/msds2023/cpt8/ml3_project/saves/epoch10_model.pth" #Edited
        if os.path.isfile(pretrained_file):
            state_dict = torch.load(pretrained_file)
            self.load_state_dict(state_dict)

    def setup_losses(self):
        self.loss_fn = nn.CrossEntropyLoss()

    def forward(self, x):
        x = self.conv1(x)
        x = F.relu(x)
        x = self.pool(x)

        x = self.conv2(x)
        x = F.relu(x)
        x = self.pool(x)

        x = self.conv3(x)
        x = F.relu(x)
        x = self.pool(x)

        x = self.conv4(x)
        x = F.relu(x)
        x = self.pool(x)

        x = self.conv5(x)
        x = F.relu(x)
        x = self.pool(x)

        x = x.view(x.size(0), -1)

        x = self.fc1(x)
        x = F.relu(x)
        x = self.fc2(x)

        return F.log_softmax(x, dim=1)
    
    def setup_metrics(self, num_classes):
        self.train_metrics = SingleLabelMetrics(num_classes=num_classes)
        self.valid_metrics = SingleLabelMetrics(num_classes=num_classes)
        self.test_metrics = SingleLabelMetrics(num_classes=num_classes)

    def training_step(self, batch, batch_idx):
        x, y, _ = batch
        logits = self(x)
        
        preds = torch.argmax(logits, dim=1)
        accuracy = (preds == y).sum().item() / len(y)

        loss = self.loss_fn(logits, y)
        self.log('train_loss', loss)
        self.log('train_accuracy', accuracy)
        self.train_metrics(logits, y)

    def validation_step(self, batch, batch_idx):
        x, y, _ = batch
        logits = self(x)
        loss = self.loss_fn(logits, y)
        self.log('val_loss', loss)
        self.valid_metrics(logits, y)
        
    def test_step(self, batch, batch_idx):
        x, y, _ = batch
        logits = self(x)
        loss = self.loss_fn(logits, y)

        # Calculate accuracy
        preds = torch.argmax(logits, dim=1)
        accuracy = (preds == y).sum().item() / len(y)

        self.log('test_loss', loss)
        self.log('test_accuracy', accuracy)
        self.test_metrics(logits, y)
        
    def configure_optimizers(self):
        optimizer = Adam(self.parameters(), lr=self.learning_rate)
        return optimizer

        
    def on_test_epoch_end(self):
        test_metrics = self.test_metrics.compute()
        self.log('test_metrics', test_metrics.compute(), prog_bar=True)
        self.test_metrics.save(model="classifier", classifier_type="SALITA",
                               dataset=self.dataset)
        self.test_metrics.reset()

        # Save the test metrics as instance attributes
        self.test_metrics_results = test_metrics
In [14]:
batch_size = 8

if device.type == "cuda":
    num_workers = 1
    pin_memory = True
else:
    num_workers = 0
    pin_memory = False

data_module = AudioDataModule(batch_size=batch_size,
                              num_workers=num_workers,
                              pin_memory=pin_memory)
In [15]:
model = SALITA()
model.to(device)
Out[15]:
SALITA(
  (conv1): Conv1d(40, 64, kernel_size=(3,), stride=(1,))
  (conv2): Conv1d(64, 128, kernel_size=(3,), stride=(1,))
  (conv3): Conv1d(128, 256, kernel_size=(3,), stride=(1,))
  (conv4): Conv1d(256, 512, kernel_size=(3,), stride=(1,))
  (conv5): Conv1d(512, 1024, kernel_size=(3,), stride=(1,))
  (pool): MaxPool1d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  (fc1): Linear(in_features=23552, out_features=512, bias=True)
  (fc2): Linear(in_features=512, out_features=14, bias=True)
  (loss_fn): CrossEntropyLoss()
  (train_metrics): SingleLabelMetrics()
  (valid_metrics): SingleLabelMetrics()
  (test_metrics): SingleLabelMetrics()
)

The code snippet loads two pickled arrays from files: test_cm and train_cm. Pickling is a way to serialize Python objects into a binary format that can be easily stored and retrieved. In this case, the arrays were likely serialized and stored using the pickle module.

The first block of code opens the file named 'test_cm.pickle' in binary mode ('rb') and uses the pickle.load() function to deserialize and load the contents of the file into the test_cm variable. Similarly, the second block of code opens the file named 'train_cm.pickle' and loads its contents into the train_cm variable.

After executing these lines, test_cm and train_cm will contain the data that was previously serialized and stored in the pickle files.

In [16]:
# Unpickle the array
with open('test_cm.pickle', 'rb') as file:
    test_cm = pickle.load(file)

# Unpickle the array
with open('train_cm.pickle', 'rb') as file:
    train_cm = pickle.load(file)

Displaying the Confusion Matrices

The code snippet below creates a ConfusionMatrixDisplay object disp using the train_cm and test_cm arrays, which are confusion matrices of the train and test sets.

The plot() method of the disp object is then called to generate a plot of the confusion matrix. The plot visually represents the performance of a classification model by showing the counts or proportions of correct and incorrect predictions for each class.

Finally, plt.show() is called to display the generated plot. This function is typically used in conjunction with the Matplotlib library to show the figures or plots created using its plotting functions.

In [17]:
disp = ConfusionMatrixDisplay(train_cm)
fig, ax = plt.subplots(figsize=(10, 8))
disp.plot(cmap='BuPu', ax=ax)
plt.show()
In [18]:
disp = ConfusionMatrixDisplay(test_cm)
fig, ax = plt.subplots(figsize=(10, 8))
disp.plot(cmap='BuPu', ax=ax)
plt.show()

Copying Files and Directories


These are commands used to copy files and directories. Let's break them down:

  1. !cp -r "./NN-Explainer/src/utils" .
    • cp is a command in Unix-like systems used to copy files and directories.
    • -r is an option that allows recursive copying, meaning it copies directories and their contents.
    • "./NN-Explainer/src/utils" specifies the source directory that we want to copy.
    • . represents the current directory, indicating the destination where the files and directories will be copied to.

This command copies the "utils" directory from the "NN-Explainer" repository's source directory to our current directory.

  1. !cp -r "./NN-Explainer/src/models" .

    • This command is similar to the previous one but copies the "models" directory from the "NN-Explainer" repository's source directory to our current directory.
  2. !cp ./explainer_salita.py ./models

    • Here, cp is used to copy a single file.
    • "./explainer_salita.py" is the source file we want to copy.
    • "./models" represents the destination directory where we want to copy the file.

This command copies the file "explainer_salita.py" to the "models" directory.

Overall, these commands are used to copy directories and files from the "NN-Explainer" repository to the current directory, allowing us to use or modify them locally.

In [19]:
!cp -r "./NN-Explainer/src/utils" .
!cp -r "./NN-Explainer/src/models" .
!cp ./explainer_salita.py ./models

Cloning the NN Explainer


The command git clone https://github.com/stevenstalder/NN-Explainer.git is used to make a copy of a GitHub repository called "NN-Explainer" by the user "stevenstalder." It allows us to access the repository's files and contribute to the project if permitted.
In [20]:
# !git clone https://github.com/stevenstalder/NN-Explainer.git

Integrating XAI


We initialize an ExplainerClassifierModel, set it to evaluation mode, prepare the necessary data module for testing, and create a directory to save the explainer model's masks based on the dataset, classifier type, and mode.

The code snippet performs the following steps:

  1. Instantiates an ExplainerClassifierModel object named explainer and moves it to the specified device.
  2. Sets the explainer model to evaluation mode.
  3. Defines the number of classes as 14.
  4. Creates an AudioDataModule object named data_module.
  5. Calls the setup method of data_module with the stage set to "test", which sets up the test dataset.
  6. Defines the dataset variable as "lang_data".
  7. Defines the classifier_type variable as "SALITA".
  8. Defines the mode variable as "seg".
  9. Creates a save_path object as a Path based on the dataset, classifier type, and "explainer" mode.
  10. Checks if the directory specified by save_path exists, and if not, creates the directory using os.makedirs().
In [21]:
explainer = ExplainerClassifierModel(classifier=model).to(device)
/opt/conda/lib/python3.10/site-packages/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and may be removed in the future, please use 'weights' instead.
  warnings.warn(
/opt/conda/lib/python3.10/site-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=None`.
  warnings.warn(msg)
In [22]:
explainer.eval()
Out[22]:
ExplainerClassifierModel(
  (explainer): Deeplabv3Resnet50ExplainerModel(
    (explainer): DeepLabV3(
      (backbone): IntermediateLayerGetter(
        (conv1): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
        (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (relu): ReLU(inplace=True)
        (maxpool): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
        (layer1): Sequential(
          (0): Bottleneck(
            (conv1): Conv2d(64, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)
            (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
            (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            (conv3): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
            (bn3): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            (relu): ReLU(inplace=True)
            (downsample): Sequential(
              (0): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
              (1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            )
          )
          (1): Bottleneck(
            (conv1): Conv2d(256, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)
            (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
            (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            (conv3): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
            (bn3): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            (relu): ReLU(inplace=True)
          )
          (2): Bottleneck(
            (conv1): Conv2d(256, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)
            (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
            (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            (conv3): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
            (bn3): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            (relu): ReLU(inplace=True)
          )
        )
        (layer2): Sequential(
          (0): Bottleneck(
            (conv1): Conv2d(256, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)
            (bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
            (bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            (conv3): Conv2d(128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
            (bn3): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            (relu): ReLU(inplace=True)
            (downsample): Sequential(
              (0): Conv2d(256, 512, kernel_size=(1, 1), stride=(2, 2), bias=False)
              (1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            )
          )
          (1): Bottleneck(
            (conv1): Conv2d(512, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)
            (bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
            (bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            (conv3): Conv2d(128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
            (bn3): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            (relu): ReLU(inplace=True)
          )
          (2): Bottleneck(
            (conv1): Conv2d(512, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)
            (bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
            (bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            (conv3): Conv2d(128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
            (bn3): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            (relu): ReLU(inplace=True)
          )
          (3): Bottleneck(
            (conv1): Conv2d(512, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)
            (bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
            (bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            (conv3): Conv2d(128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
            (bn3): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            (relu): ReLU(inplace=True)
          )
        )
        (layer3): Sequential(
          (0): Bottleneck(
            (conv1): Conv2d(512, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
            (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
            (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            (conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
            (bn3): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            (relu): ReLU(inplace=True)
            (downsample): Sequential(
              (0): Conv2d(512, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
              (1): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            )
          )
          (1): Bottleneck(
            (conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
            (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(2, 2), dilation=(2, 2), bias=False)
            (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            (conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
            (bn3): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            (relu): ReLU(inplace=True)
          )
          (2): Bottleneck(
            (conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
            (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(2, 2), dilation=(2, 2), bias=False)
            (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            (conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
            (bn3): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            (relu): ReLU(inplace=True)
          )
          (3): Bottleneck(
            (conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
            (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(2, 2), dilation=(2, 2), bias=False)
            (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            (conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
            (bn3): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            (relu): ReLU(inplace=True)
          )
          (4): Bottleneck(
            (conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
            (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(2, 2), dilation=(2, 2), bias=False)
            (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            (conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
            (bn3): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            (relu): ReLU(inplace=True)
          )
          (5): Bottleneck(
            (conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
            (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(2, 2), dilation=(2, 2), bias=False)
            (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            (conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
            (bn3): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            (relu): ReLU(inplace=True)
          )
        )
        (layer4): Sequential(
          (0): Bottleneck(
            (conv1): Conv2d(1024, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
            (bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(2, 2), dilation=(2, 2), bias=False)
            (bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            (conv3): Conv2d(512, 2048, kernel_size=(1, 1), stride=(1, 1), bias=False)
            (bn3): BatchNorm2d(2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            (relu): ReLU(inplace=True)
            (downsample): Sequential(
              (0): Conv2d(1024, 2048, kernel_size=(1, 1), stride=(1, 1), bias=False)
              (1): BatchNorm2d(2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            )
          )
          (1): Bottleneck(
            (conv1): Conv2d(2048, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
            (bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(4, 4), dilation=(4, 4), bias=False)
            (bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            (conv3): Conv2d(512, 2048, kernel_size=(1, 1), stride=(1, 1), bias=False)
            (bn3): BatchNorm2d(2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            (relu): ReLU(inplace=True)
          )
          (2): Bottleneck(
            (conv1): Conv2d(2048, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
            (bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(4, 4), dilation=(4, 4), bias=False)
            (bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            (conv3): Conv2d(512, 2048, kernel_size=(1, 1), stride=(1, 1), bias=False)
            (bn3): BatchNorm2d(2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            (relu): ReLU(inplace=True)
          )
        )
      )
      (classifier): DeepLabHead(
        (0): ASPP(
          (convs): ModuleList(
            (0): Sequential(
              (0): Conv2d(2048, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
              (1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
              (2): ReLU()
            )
            (1): ASPPConv(
              (0): Conv2d(2048, 256, kernel_size=(3, 3), stride=(1, 1), padding=(12, 12), dilation=(12, 12), bias=False)
              (1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
              (2): ReLU()
            )
            (2): ASPPConv(
              (0): Conv2d(2048, 256, kernel_size=(3, 3), stride=(1, 1), padding=(24, 24), dilation=(24, 24), bias=False)
              (1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
              (2): ReLU()
            )
            (3): ASPPConv(
              (0): Conv2d(2048, 256, kernel_size=(3, 3), stride=(1, 1), padding=(36, 36), dilation=(36, 36), bias=False)
              (1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
              (2): ReLU()
            )
            (4): ASPPPooling(
              (0): AdaptiveAvgPool2d(output_size=1)
              (1): Conv2d(2048, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
              (2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
              (3): ReLU()
            )
          )
          (project): Sequential(
            (0): Conv2d(1280, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
            (1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            (2): ReLU()
            (3): Dropout(p=0.5, inplace=False)
          )
        )
        (1): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (3): ReLU()
        (4): Conv2d(256, 14, kernel_size=(1, 1), stride=(1, 1))
      )
    )
  )
  (classifier): SALITA(
    (conv1): Conv1d(40, 64, kernel_size=(3,), stride=(1,))
    (conv2): Conv1d(64, 128, kernel_size=(3,), stride=(1,))
    (conv3): Conv1d(128, 256, kernel_size=(3,), stride=(1,))
    (conv4): Conv1d(256, 512, kernel_size=(3,), stride=(1,))
    (conv5): Conv1d(512, 1024, kernel_size=(3,), stride=(1,))
    (pool): MaxPool1d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (fc1): Linear(in_features=23552, out_features=512, bias=True)
    (fc2): Linear(in_features=512, out_features=14, bias=True)
    (loss_fn): CrossEntropyLoss()
    (train_metrics): SingleLabelMetrics()
    (valid_metrics): SingleLabelMetrics()
    (test_metrics): SingleLabelMetrics()
  )
  (total_variation_conv): TotalVariationConv(
    (variance_right_filter): Conv2d(1, 1, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False, padding_mode=reflect)
    (variance_down_filter): Conv2d(1, 1, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False, padding_mode=reflect)
  )
  (classification_loss_fn): CrossEntropyLoss()
  (train_metrics): SingleLabelMetrics()
  (valid_metrics): SingleLabelMetrics()
  (test_metrics): SingleLabelMetrics()
)
In [23]:
num_classes = 14
# data_base_path = '../../datasets/'
# data_path = Path(data_base_path) / "lang_data"
data_module = AudioDataModule()
data_module.setup(stage = "test")
dataset = "lang_data"
classifier_type = "SALITA"
mode = "seg"
In [24]:
save_path = Path('masks/{}_{}_{}/'.format(dataset, classifier_type, "explainer"))
if not os.path.isdir(save_path):
    os.makedirs(save_path)
In [25]:
i2l_dict = {
    0.0: 'HI',
    1.0: 'NE',
    2.0: 'TH',
    3.0: 'SI',
    4.0: 'JA',
    5.0: 'PA',
    6.0: 'AR',
    7.0: 'TA',
    8.0: 'KN',
    9.0: 'FA',
    10.0: 'MY',
    11.0: 'ZH',
    12.0: 'UR',
    13.0: 'ID',
}

l2i_dict = {y: x for x, y in i2l_dict.items()}

Loading Audio Samples


The following code loads an audio sample, computes its Mel spectrogram, prepares the input tensor for the explainer model by repeating the spectrogram, and obtains the model's predicted mask and prediction for the input.

The code performs the following operations:
1. Sets the audio_sample_path variable to the path of an audio sample file named "Luffy_cut.wav".
2. Sets the label variable to 'JA'.
3. Retrieves the corresponding label_index from a dictionary l2i_dict based on the label.
4. Defines num_frames as 160,000 and n_mels as 40.
5. Loads the audio file at audio_sample_path using torchaudio's load function, limiting it to num_frames frames. It returns the audio signal as signal and the sample rate as sr.
6. Creates a Mel spectrogram transformation object using n_mels as the number of mel filter banks.
7. Applies the Mel spectrogram transformation to the signal using the transform object. The resulting spectrogram is stored in mfcc after squeezing it to remove any single-dimensional axes.

The code then continues with the following:
1. Repeats the mfcc tensor along the first dimension three times, effectively creating a tensor with three copies of the spectrogram. This is stored in the variable x.
2. Moves the x tensor to the specified device.
3. Creates a tensor y containing the label index, and moves it to the specified device.
4. Extracts the filename from audio_sample_path using the rsplit() method.
5. Passes the x and y tensors to the explainer model, which returns several outputs. Here, the code only captures the returned mask and predict tensors, discarding the other outputs.
In [26]:
audio_sample_path = 'Luffy_cut.wav'
label = 'JA'
label_index = l2i_dict[label]
num_frames = 160_000
n_mels = 40
signal, sr = ta.load(audio_sample_path, num_frames=num_frames)
transform = transforms.MelSpectrogram(sr, n_mels=n_mels)
mfcc = transform(signal).squeeze()

x = mfcc.repeat(3, 1, 1)
x = x.to(device)
y = torch.tensor(label_index).to(device)
filename = audio_sample_path.rsplit('/', 1)[-1]

_, _, mask, _, _ = explainer(x, y)
predict = explainer.classifier(x)

The plot_waveform function takes an audio waveform, slices or pads it to the desired length, and plots it using matplotlib. It provides a visual representation of the waveform with the specified title.

In [27]:
def plot_waveform(waveform, sr, num_frames, title="Waveform"):
    waveform = waveform.numpy()
    n_channels, n_frames = waveform.shape
    if n_channels > 1:
        waveform = waveform[:1]
    
    if n_frames > num_frames:
        padded = waveform[:,:num_frames]
    else:
        padded = np.zeros((1, num_frames))
        padded[:, :num_frames] = waveform
        
    time_axis = torch.arange(0, num_frames) / sr

    figure, axes = plt.subplots(1, 1, figsize=(15, 6))
    axes.plot(time_axis, padded[0], linewidth=1, c='k')
    axes.axis('off')
    figure.suptitle(title)
    plt.show(block=False)
    
    return padded
In [28]:
fig, ax = plt.subplots(1, 2, figsize=(15, 4))
sns.heatmap(torch.log(x.view(40, 801, -1)[:, :, 0]+1e-3).cpu(),
            cmap='PuRd',
            cbar=False,
            ax=ax[0])
sns.heatmap(mask.view(40, 801, -1)[:, :, 0].cpu(),
            cmap='gray',
            cbar=False,
            ax=ax[1])
ax[0].axis('off')
ax[0].set_title('MFCC of the Sample Audio File')
ax[1].axis('off')
ax[1].set_title('Saliency Mask of the Sample Audio File')
toc.add_fig('MFCC Representation - Sample', width=100)
plots
Figure 1. MFCC Representation - Sample.
.

Defining Utility Functions for Object Manipulation and Visualization

Let's go through each function:

  1. erode(image, selem, n=1): Performs erosion on the input image using the structuring element selem. The erosion operation shrinks the bright regions in the image. It can be applied multiple times by specifying the parameter n. The function returns the eroded image.

  2. dilate(image, selem, n=1): Performs dilation on the input image using the structuring element selem. The dilation operation expands the bright regions in the image. It can be applied multiple times by specifying the parameter n. The function returns the dilated image.

  3. n_close(image, selem, n=1): Performs closing on the input image using the structuring element selem. Closing is the combination of dilation followed by erosion and is useful for closing small gaps or holes in the bright regions of the image. It can be applied multiple times by specifying the parameter n. The function returns the closed image.

  4. n_open(image, selem, n=1): Performs opening on the input image using the structuring element selem. Opening is the combination of erosion followed by dilation and is useful for removing small bright regions or smoothing the edges of bright regions in the image. It can be applied multiple times by specifying the parameter n. The function returns the opened image.

  5. plot_waveform(waveform, sr, num_frames, title="Waveform"): Takes an audio waveform represented by the waveform tensor, the sample rate sr, and the desired number of frames num_frames. It plots the waveform using matplotlib, ensuring that it has the specified number of frames. The resulting plot is displayed, and the padded waveform is returned.

  6. colorFader(c1, c2, mix=0): Performs linear interpolation between two colors c1 and c2 based on the mix parameter (0 to 1). It returns the interpolated color in hexadecimal format.

In [29]:
from skimage.morphology import erosion, dilation, opening, closing
def erode(image, selem, n=1):
    """Perform erosion `n` times"""
    for _ in range(n):
        image = erosion(image, selem)
    
    return image


def dilate(image, selem, n=1):
    """Perform dilation `n` times"""
    for _ in range(n):
        image = dilation(image, selem)
    
    return image


def n_close(image, selem, n=1):
    """Perform dilation `n` times"""
    for _ in range(n):
        image = closing(image, selem)
    
    return image


def n_open(image, selem, n=1):
    """Perform dilation `n` times"""
    for _ in range(n):
        image = opening(image, selem)
    
    return image


def plot_waveform(waveform, sr, num_frames, title="Waveform"):
    waveform = waveform.numpy()
    n_channels, n_frames = waveform.shape
    if n_channels > 1:
        waveform = waveform[:1]
    
    if n_frames > num_frames:
        padded = waveform[:,:num_frames]
    else:
        padded = np.zeros((1, num_frames))
        padded[:, :num_frames] = waveform
        
    time_axis = torch.arange(0, num_frames) / sr

    figure, axes = plt.subplots(1, 1, figsize=(15, 6))
    axes.plot(time_axis, padded[0], linewidth=1, c='k')
    axes.axis('off')
    figure.suptitle(title)
    toc.add_fig('Audio File Visualization')
    
    return padded


def colorFader(c1,c2,mix=0): #fade (linear interpolate) from color c1 (at mix=0) to c2 (mix=1)
    c1=np.array(mpl.colors.to_rgb(c1))
    c2=np.array(mpl.colors.to_rgb(c2))
    return mpl.colors.to_hex((1-mix)*c1 + mix*c2)

Setting Mask and Waveform Data Operations


The next code snippet performs various operations on the input mask and waveform data. Here's a breakdown of the steps:

  1. waveform, sr = ta.load(audio_sample_path): Loads an audio waveform from the specified file path using the ta.load function. The resulting waveform and sample rate are assigned to the variables waveform and sr, respectively.

  2. plot_mask = mask + mask.min(): Adds the minimum value of the mask tensor to itself, effectively shifting the values to be non-negative.

  3. plot_mask = (plot_mask.sum(1) / plot_mask.sum(1).max()).mean(0): Normalizes the plot_mask tensor by dividing each row by the maximum value across rows and then taking the mean along the 0th dimension.

  4. thick_mask = (plot_mask > plot_mask.quantile(.75)).float(): Creates a binary mask tensor thick_mask by thresholding the plot_mask tensor at the 75th percentile value, converting it to a float tensor.

  5. wave_mask = plot_mask[:-1].view(1, -1).t().repeat(1, 200).view(1, -1).cpu().numpy(): Extracts a subset of plot_mask excluding the last element, reshapes it, repeats it 200 times horizontally, reshapes it again, and converts it to a NumPy array. The resulting wave_mask has a modified shape for further processing.

  6. closed_mask = erode(thick_mask.repeat(3, 1).cpu().numpy(), np.array([[0,0,0],[1,1,1],[0,0,0]]), 5): Performs erosion operation on the thick_mask tensor by repeatedly applying a 3x3 erosion structuring element. The erosion operation is performed 5 times. The resulting tensor is converted to a NumPy array.

  7. closed_mask = dilate(closed_mask, np.array([[0,0,0],[1,1,1],[0,0,0]]), 5): Performs dilation operation on the closed_mask tensor by repeatedly applying a 3x3 dilation structuring element. The dilation operation is performed 5 times.

  8. closed_mask = n_close(closed_mask, np.array([[0,0,0],[1,1,1],[0,0,0]]), 5): Performs closing operation on the closed_mask tensor by repeatedly applying a 3x3 structuring element. The closing operation is performed 5 times.

  9. wave_top_mask = np.repeat(closed_mask[0, :-1].reshape(1, -1).T, 200, 1).reshape(1, -1): Reshapes the first row of the closed_mask tensor by excluding the last element, repeats it 200 times horizontally, and reshapes it again. The resulting wave_top_mask has a modified shape for further processing.

In [30]:
waveform, sr = ta.load(audio_sample_path)
padded_waveform = plot_waveform(waveform, sr, num_frames, title="Original waveform")
plot_mask = mask + mask.min()
plot_mask = (plot_mask.sum(1) / plot_mask.sum(1).max()).mean(0)
thick_mask = (plot_mask > plot_mask.quantile(.75)).float()
wave_mask = plot_mask[:-1].view(1, -1).t().repeat(1, 200).view(1, -1).cpu().numpy()

closed_mask = erode(thick_mask.repeat(3, 1).cpu().numpy(),
                      np.array([[0,0,0],
                                [1,1,1],
                                [0,0,0]]),
                      5)
closed_mask = dilate(closed_mask,
                      np.array([[0,0,0],
                                [1,1,1],
                                [0,0,0]]),
                      5)
closed_mask = n_close(closed_mask,
                      np.array([[0,0,0],
                                [1,1,1],
                                [0,0,0]]),
                      5)

wave_top_mask = np.repeat(closed_mask[0, :-1].reshape(1, -1).T, 200, 1).reshape(1, -1)
plots
Figure 2. Audio File Visualization.
.

Visualization of Waveform with Importance Gradients


The next code generates a figure with two subplots.

In the first subplot (ax[0]), it plots the original waveform using the padded_waveform data. The waveform is represented by a black line.

In the second subplot (ax[1]), it visualizes the waveform with importance gradients. It iterates over n timesteps and adds vertical lines (axvline) at each timestep using different colors based on the gradient values derived from the closed_mask data. The colors of the vertical lines are interpolated between 'white' (c1) and 'purple' (c2) using the colorFader function.

Both subplots have their axes turned off, and the figure is given a title.

The last line (toc.add_fig(...)) is not directly related to the plot but suggests that the figure is added to some kind of table of contents (toc) for further reference or organization.
In [31]:
c1='white'
c2='purple'
n=801
    
fig, ax = plt.subplots(1, 2, figsize=(15, 2))
for i in range(n):
    ax[1].axvline(i, color=colorFader(c1,c2,closed_mask[0, i]), linewidth=4)

time_axis = torch.arange(0, 160_000) / 200

for ai in ax:
    ai.plot(time_axis, padded_waveform[0], linewidth=1, c='k')
    ai.axis('off')

ax[0].set_title('Original Waveform')
ax[1].set_title('Waveform with Importance Gradients')
toc.add_fig('Sample Timestep Importance Identification', width=100)
plots
Figure 3. Sample Timestep Importance Identification.
.
In [32]:
Audio(padded_waveform, rate=sr)
Out[32]:
Your browser does not support the audio element.
In [33]:
Audio(padded_waveform*wave_top_mask, rate=sr)
Out[33]:
Your browser does not support the audio element.

Back to Table of Contents


Conclusion


Throughout our project, we encountered a significant challenge: the size of the dataset and the limitations of our available resources. Despite our relentless efforts to enhance the performance and mitigate overfitting, we ultimately reached the constraints of our machine's capabilities, which imposed limitations on our model.

However, we strategically focused our efforts on leveraging the power of the two-dimensional Mel-frequency cepstral coefficients (MFCC) representation of audio waves. To achieve explainability, we employed a Vision Transformer-based model that could identify and assign saliency to the input matrix. By decomposing these values back to the wave signal domain, we obtained gradient values for each time step, indicating the importance of each window in the audio signal. Through careful aggregation and manipulation of these gradients, we captured the overall wide-scale allocation of attention within our model.

The original audio waves were subsequently masked to filter out unimportant information, retaining only the sound signals found in the relevant timesteps as indicated by the aggregated gradients. This process perfectly encapsulates our approach: what you hear is what you classify.

Although it is unfortunate that our Salita Model didn't perform as well as we had initially hoped, we have made a crucial trade-off. The explainability of our model was made possible by training different networks on the two-dimensional representations of audio signal waves. This empowers us to understand and interpret the decision-making process of our model, shedding light on the features and patterns within the audio data that contribute to the classification outcomes.

While our model's performance may have reached its limits, the insights and explanations gained through our innovative approach pave the way for further advancements and breakthroughs in the field of audio signal analysis and classification. By combining the power of audio representations with explainability techniques, we open doors to new possibilities and foster a deeper understanding of complex audio data.

Back to Table of Contents


Recommendations


For future studies on the dataset, there are several promising directions that can further enhance the understanding and utilization of the SALITA Pro Max Ultra model. These recommendations aim to expand the scope of analysis and improve the applicability of the model in real-world scenarios:
  • One intriguing area to explore is emotion detection. Emotions play a significant role in communication and can greatly influence the way languages are spoken and perceived. By incorporating emotion detection into the language classification framework, we can gain deeper insights into the emotional aspects of spoken language. This could involve training models to recognize and classify different emotional states conveyed through speech, such as happiness, sadness, anger, or surprise. This extension would enable SALITA Pro Max Ultra to not only identify languages but also capture and interpret the emotional nuances embedded within speech.

  • Expanding the dataset to include more languages is another valuable recommendation. The current dataset encompasses 14 languages spoken in Asian countries, providing a solid foundation for language classification. However, the inclusion of additional languages from different regions and linguistic backgrounds would further enrich the dataset and increase its diversity. Incorporating languages from various continents, such as African, European, or South American languages, would enhance the generalizability and versatility of the model, making it applicable to a wider range of linguistic contexts.

  • Integrating the SALITA Pro Max Ultra models with auto-captioners would significantly enhance the accessibility and usability of the system. Auto-captioners automatically generate captions or subtitles for audio or video content, facilitating comprehension for individuals with hearing impairments or those in noisy environments. By integrating the language classification models with auto-captioning technology, we can provide accurate and real-time captions that are tailored to the specific language being spoken. This integration would enable seamless communication and improve accessibility across different platforms, including online videos, live events, and teleconferences.

  • Lastly, validating the findings of SALITA Pro Max Ultra with sound experts is essential to ensure the accuracy and reliability of the model. Collaborating with experts in phonetics, linguistics, and acoustic analysis can help validate the effectiveness of the model in accurately identifying languages and interpreting emotional cues. Their expertise can provide valuable insights, feedback, and validation, ensuring that the model's predictions align with established linguistic principles and acoustic patterns. The collaboration with sound experts would also contribute to the ongoing development and refinement of the language classification model, further enhancing its performance and reliability.

Back to Table of Contents


References


In this section, you will find the references that were used to support the information presented in this study. These references include academic articles, books, reports, and other sources of information that were deemed relevant to the topic at hand. The references are listed in alphabetical order by author's last name, and follow the guidelines set out by the APA (American Psychological Association) style of referencing.

[1] Chouhan, D. (2023, May 12). Lang_data. [Dataset]. Kaggle. Retrieved from https://www.kaggle.com/datasets/shadowfax/lang-data

[2] Stalder, S., Perraudin, N., Achanta, R., Perez-Cruz, F., & Volpi, M. (2022). What You See is What You Classify: Black Box Attributions. In Advances in Neural Information Processing Systems 35 (NeurIPS 2022) Main Conference Track. Retrieved from https://proceedings.neurips.cc/paper_files/paper/2022/file/0073cc73e1873b35345209b50a3dab66-Paper-Conference.pdf

[3] Wang, Z. [UN ESCAP]. (2023, April 18). Multilingualism at the UN: Linguistic Diversity in the Asia-Pacific Region [Video file]. Retrieved from https://www.youtube.com/watch?v=221A6yWDRbE

[4] International Organization for Standardization. (2002). ISO 639-1:2002, Codes for the representation of names of languages — Part 1: Alpha-2 code. Retrieved from https://www.iso.org/standard/22109.html

[5] UN Economic and Social Commission for Asia and the Pacific. (n.d.). One UN, many voices: Why multilingualism matters. Retrieved from https://www.unescap.org/story/one-un-many-voices-why-multilingualism-matters

[6] Stalder, S. (n.d.). NN-Explainer. Retrieved from https://github.com/stevenstalder/NN-Explainer


Back to Table of Contents